--- layout: null title: "NYC Collision Data Exploration" tags: [R, Data, Analysis, Exploratory, RStudio, NYC, Collision, Vehicle] ---
This notebook explores the more recent data from NYC Open Data Data Set.
This dataset can also be reached and interacted with through its Google BigQuery location
Try a different version: handout page, Tab Navigation
Reduce Car Accidents in Brooklyn
For this exercise, we’d like you to analyze data on New York motor vehicle collisions and answer the following question:
What are your ideas for reducing accidents in Brooklyn?
Imagine you are preparing this presentation for the city council who will use it to inform new legislation and/or projects.
Briefly:
Libraries that will be used during exploration
library(magrittr)
library(dplyr)
library(ggplot2)
library(viridis)
library(plotly)
library(maps)
library(rgeos)
library(rgdal)
library(ggthemes)
library(crosstalk)
library(leaflet)
library(d3scatter)
library(d3heatmap)
library(rnoaa)
Collect API tokens in Environment Variables (purposefully kept hidden here). Tokens and keys used include Google Maps API key (get one here, Mapbox Access Token(get one here) and an NCDC token (here) for NOAA weather data.
Load data from /data directory and into memory
dt <- read.csv(file = "data/NYPD_Motor_Vehicle_Collisions.csv")
Inspect structure of dataset with the str() command:
str(dt)
## 'data.frame': 990800 obs. of 29 variables:
## $ DATE : Factor w/ 1709 levels "01/01/2013","01/01/2014",..: 200 200 200 200 1068 920 611 200 65 65 ...
## $ TIME : Factor w/ 1440 levels "0:00","0:01",..: 556 631 632 644 661 936 46 686 1051 1066 ...
## $ BOROUGH : Factor w/ 6 levels "","BRONX","BROOKLYN",..: 1 2 2 3 1 1 1 1 3 3 ...
## $ ZIP.CODE : int NA 10454 10466 11218 NA NA NA NA 11218 11236 ...
## $ LATITUDE : num 40.7 40.8 40.9 40.6 40.7 ...
## $ LONGITUDE : num -73.9 -73.9 -73.9 -74 -73.9 ...
## $ LOCATION : Factor w/ 90272 levels "","(0.0, 0.0)",..: 28545 73165 89328 17808 52600 1 1 14824 17763 18379 ...
## $ ON.STREET.NAME : Factor w/ 9151 levels "","?EST 125 STREET",..: 1420 1 3493 1 1 1 6598 4069 738 7071 ...
## $ CROSS.STREET.NAME : Factor w/ 9585 levels "","0","01247",..: 1 1 9364 1 1 1 7399 3638 114 4201 ...
## $ OFF.STREET.NAME : Factor w/ 59908 levels "","(26 BROOKLYN TERMINAL MARKET LOT)",..: 1 38225 1 29898 1 1 1 1 1 1 ...
## $ NUMBER.OF.PERSONS.INJURED : int 0 0 1 0 0 0 0 0 1 2 ...
## $ NUMBER.OF.PERSONS.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.PEDESTRIANS.INJURED: int 0 0 1 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.PEDESTRIANS.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.CYCLIST.INJURED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.CYCLIST.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ NUMBER.OF.MOTORIST.INJURED : int 0 0 0 0 0 0 0 0 1 2 ...
## $ NUMBER.OF.MOTORIST.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
## $ CONTRIBUTING.FACTOR.VEHICLE.1: Factor w/ 49 levels "","Accelerator Defective",..: 10 47 47 47 47 11 1 43 43 43 ...
## $ CONTRIBUTING.FACTOR.VEHICLE.2: Factor w/ 49 levels "","Accelerator Defective",..: 47 1 1 47 47 47 1 47 47 47 ...
## $ CONTRIBUTING.FACTOR.VEHICLE.3: Factor w/ 43 levels "","Accelerator Defective",..: 1 1 1 1 1 1 1 1 42 1 ...
## $ CONTRIBUTING.FACTOR.VEHICLE.4: Factor w/ 42 levels "","Accelerator Defective",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ CONTRIBUTING.FACTOR.VEHICLE.5: Factor w/ 31 levels "","Aggressive Driving/Road Rage",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ UNIQUE.KEY : int 3612721 3612791 3618743 3614471 3284922 2833714 336679 3618925 3598095 3597360 ...
## $ VEHICLE.TYPE.CODE.1 : Factor w/ 18 levels "","AMBULANCE",..: 15 10 12 15 10 10 1 10 10 15 ...
## $ VEHICLE.TYPE.CODE.2 : Factor w/ 18 levels "","AMBULANCE",..: 10 1 1 10 16 10 1 10 10 15 ...
## $ VEHICLE.TYPE.CODE.3 : Factor w/ 18 levels "","AMBULANCE",..: 1 1 1 1 1 1 1 1 15 1 ...
## $ VEHICLE.TYPE.CODE.4 : Factor w/ 18 levels "","AMBULANCE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ VEHICLE.TYPE.CODE.5 : Factor w/ 16 levels "","AMBULANCE",..: 1 1 1 1 1 1 1 1 1 1 ...
Inspect summary of dataset with summary() command:
summary(dt)
## DATE TIME BOROUGH
## 01/21/2014: 1161 16:00 : 12792 :260725
## 01/18/2015: 960 15:00 : 12748 BRONX : 95396
## 02/03/2014: 960 17:00 : 12597 BROOKLYN :223552
## 03/06/2015: 936 18:00 : 11641 MANHATTAN :187571
## 01/07/2017: 887 14:00 : 11094 QUEENS :189619
## 09/30/2016: 872 13:00 : 10365 STATEN ISLAND: 33937
## (Other) :985024 (Other):919563
## ZIP.CODE LATITUDE LONGITUDE
## Min. :10000 Min. : 0.00 Min. :-201.36
## 1st Qu.:10075 1st Qu.:40.67 1st Qu.: -73.98
## Median :11205 Median :40.72 Median : -73.93
## Mean :10808 Mean :40.72 Mean : -73.92
## 3rd Qu.:11236 3rd Qu.:40.77 3rd Qu.: -73.87
## Max. :11697 Max. :40.91 Max. : 0.00
## NA's :260826 NA's :201443 NA's :201443
## LOCATION ON.STREET.NAME
## :201443 :188246
## (40.6960346, -73.9845292): 673 BROADWAY : 10832
## (40.7606005, -73.9643142): 544 ATLANTIC AVENUE : 9354
## (40.7572323, -73.9897922): 485 NORTHERN BOULEVARD: 7490
## (40.6757357, -73.8968533): 480 3 AVENUE : 6864
## (40.6585778, -73.8906229): 464 FLATBUSH AVENUE : 6500
## (Other) :786711 (Other) :761514
## CROSS.STREET.NAME OFF.STREET.NAME
## :217648 :916764
## 3 AVENUE: 11407 PARKING LOT 110-00 ROCKAWAY BOULEVARD : 150
## BROADWAY: 11088 PARKING LOT-772 EDGEWATER RD : 91
## 2 AVENUE: 9678 PARKING LOT OF 110-00 ROCKAWAY BOULEVARD: 90
## 5 AVENUE: 7846 3 AVENUE : 72
## 7 AVENUE: 7312 2 AVENUE : 67
## (Other) :725821 (Other) : 73566
## NUMBER.OF.PERSONS.INJURED NUMBER.OF.PERSONS.KILLED
## Min. : 0.0000 Min. :0.000000
## 1st Qu.: 0.0000 1st Qu.:0.000000
## Median : 0.0000 Median :0.000000
## Mean : 0.2552 Mean :0.001214
## 3rd Qu.: 0.0000 3rd Qu.:0.000000
## Max. :43.0000 Max. :5.000000
##
## NUMBER.OF.PEDESTRIANS.INJURED NUMBER.OF.PEDESTRIANS.KILLED
## Min. : 0.00000 Min. :0.0000000
## 1st Qu.: 0.00000 1st Qu.:0.0000000
## Median : 0.00000 Median :0.0000000
## Mean : 0.05455 Mean :0.0006833
## 3rd Qu.: 0.00000 3rd Qu.:0.0000000
## Max. :15.00000 Max. :2.0000000
##
## NUMBER.OF.CYCLIST.INJURED NUMBER.OF.CYCLIST.KILLED
## Min. :0.00000 Min. :0.00e+00
## 1st Qu.:0.00000 1st Qu.:0.00e+00
## Median :0.00000 Median :0.00e+00
## Mean :0.02093 Mean :7.47e-05
## 3rd Qu.:0.00000 3rd Qu.:0.00e+00
## Max. :6.00000 Max. :1.00e+00
##
## NUMBER.OF.MOTORIST.INJURED NUMBER.OF.MOTORIST.KILLED
## Min. : 0.0000 Min. :0.000000
## 1st Qu.: 0.0000 1st Qu.:0.000000
## Median : 0.0000 Median :0.000000
## Mean : 0.1927 Mean :0.000463
## 3rd Qu.: 0.0000 3rd Qu.:0.000000
## Max. :43.0000 Max. :5.000000
##
## CONTRIBUTING.FACTOR.VEHICLE.1
## Unspecified :523736
## Driver Inattention/Distraction:127688
## Fatigued/Drowsy : 48249
## Failure to Yield Right-of-Way : 42948
## Other Vehicular : 30393
## Backing Unsafely : 27886
## (Other) :189900
## CONTRIBUTING.FACTOR.VEHICLE.2
## Unspecified :738985
## :123724
## Driver Inattention/Distraction: 37843
## Other Vehicular : 17711
## Fatigued/Drowsy : 13016
## Failure to Yield Right-of-Way : 9087
## (Other) : 50434
## CONTRIBUTING.FACTOR.VEHICLE.3
## :925696
## Unspecified : 59537
## Other Vehicular : 1225
## Fatigued/Drowsy : 1122
## Driver Inattention/Distraction: 1100
## Pavement Slippery : 234
## (Other) : 1886
## CONTRIBUTING.FACTOR.VEHICLE.4
## :976717
## Unspecified : 12938
## Fatigued/Drowsy : 222
## Other Vehicular : 221
## Driver Inattention/Distraction: 192
## Pavement Slippery : 67
## (Other) : 443
## CONTRIBUTING.FACTOR.VEHICLE.5 UNIQUE.KEY
## :987360 Min. : 22
## Unspecified : 3186 1st Qu.: 249509
## Other Vehicular : 52 Median :3131520
## Fatigued/Drowsy : 48 Mean :2054070
## Driver Inattention/Distraction: 36 3rd Qu.:3379220
## Pavement Slippery : 23 Max. :3627969
## (Other) : 95
## VEHICLE.TYPE.CODE.1
## PASSENGER VEHICLE :579372
## SPORT UTILITY / STATION WAGON:218537
## TAXI : 37190
## VAN : 26511
## OTHER : 24699
## UNKNOWN : 20713
## (Other) : 83778
## VEHICLE.TYPE.CODE.2
## PASSENGER VEHICLE :438701
## SPORT UTILITY / STATION WAGON:165455
## :134997
## UNKNOWN : 80864
## TAXI : 31205
## OTHER : 25249
## (Other) :114329
## VEHICLE.TYPE.CODE.3
## :926878
## PASSENGER VEHICLE : 38181
## SPORT UTILITY / STATION WAGON: 15761
## UNKNOWN : 3240
## VAN : 1401
## TAXI : 1163
## (Other) : 4176
## VEHICLE.TYPE.CODE.4
## :977085
## PASSENGER VEHICLE : 8441
## SPORT UTILITY / STATION WAGON: 3553
## UNKNOWN : 583
## VAN : 248
## OTHER : 205
## (Other) : 685
## VEHICLE.TYPE.CODE.5
## :987436
## PASSENGER VEHICLE : 2072
## SPORT UTILITY / STATION WAGON: 958
## UNKNOWN : 94
## OTHER : 52
## VAN : 50
## (Other) : 138
Our Dataset structure revealed the variables and their classes sapply(names(dt), function(x) paste0(x, ' is class: ', class(dt[[x]])))
The first thing to come to mind with such a factor heavy dataset is counting. Factors are not a hodgepodge collection of values, observed as they are pick up off the ground. Each Level of a factor - ideally - should have been intentionally designed. Ordered and distributed according to a purpose greater than the unit. Though not statically related as quantifiable assets, the levels in a factor are each related to one and other in addition to the group as a whole.
There are a lot of empty cells. To make sure we use a universal value for blank or Not Available we will assign the value NA to all blank cells. While munging around with the data, add what could be a valuable variable created from two current variables. The DATE and TIME variables are set up as factors. This has interesting categorical value so they will stay in the data. Rather than replace the two rows, add a third one in a POSIX date form.
dt[dt == ''] <- NA
# create date.time column for time use
#dt$date.time <- as.Date(dt$DATE, format("%m/%d/%Y"))
# use hours & min in date.time var for calculations
dt <- within(dt, {date.time = as.POSIXct(strptime(paste(DATE, TIME), "%m/%d/%Y %H:%M"))})
With Latitude and Longitude present and appearing to be fairly well documented, let’s take a quick look at how these accidents look over an interactive world map (incase of mistakes outlying somewhere aside from New York). We will use the BOROUGH variable as a factor. This gives the geographic association of each borough and allows us early forsight into anything specific about our point of interest BOURGH == "BROOKLYN"
mp <- dt %>%
plot_mapbox(lat = ~LATITUDE, lon = ~LONGITUDE,
split = ~BOROUGH, mode = 'scattermapbox')
plotly_build(mp)